Wrangle & Analyze "WeRateDogs" Data

Table of Contents

  1. Introduction
  2. Gathering data
  3. Assessing data
  4. Cleaning data
  5. Storing, Analyzing, and Visualizing

Introduction

The purpose of this project is to demonsrtize the skills we learned in data wrangling part of Udacity Data Analysis Nanodegree program

Gathering data:

1. The WeRateDogs Twitter archive :

The WeRateDogs Twitter archive. I am giving this file to you, so imagine it as a file on hand. Download this file manually by clicking the following link: twitter_archive_enhanced.csv

2. Tweet image prediction

The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

3. Twitter API File

Assessing data for this project

WeRateDogs Twitter archive asses:

Quality in twitter_archive

  1. timestamp and retweeted_status_timestamp must be of datetime instead of the object.
  2. There is no duplicated row founded.
  3. Missing values are present in columns in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id , retweeted_status_user_id, retweeted_status_timestamp, expanded_urls
  4. Some og the values in rating_numerator and rating_denominator columns incosistent values not acceptable values. For example rating_denominator max value is 170 and min value is 0.
  5. 639 expanded urls which contain more than one url address
  6. Dog 'stage' classification (doggo, floofer, pupper or puppo) should be one column.
  7. tweet id 835246439529840640 has a rating of denominator = 0
  8. Crazy names found for dogs - 'infuriating', 'just', 'life', 'light', 'mad', 'my', 'not', 'officially', 'old', 'one', 'quite', 'space', 'such', 'the', 'this', 'unacceptable', 'very'

Tidiness in twitter_archive

  1. This twitter_archive is like a main base table with the above attributes, there will be other attributes other dataframes. Hence we need to join all other dataframes to get a final dataframe.
  2. Dog 'stage' classification (doggo, floofer, pupper or puppo) should be one column.
  3. Some dogs have more than one category assigned.

Image Predictions asses :

Quality & Tidiness Issues in Image Predictions:

  1. The dataset has 2075 row, 12 colums.

  2. Column names are confusing and do not give much information about the content.

  3. Dog breeds contain underscores, and have different case formatting.

  4. 66 jpg_url duplicates were found.

  5. Dataset should be merged with the twitter archive dataset.

Twitter API Data Assess:

Quality & Tidiness Issues in Twitter API Data
  1. twitter API data has 2354 rows,3 columns.

  2. dataset should be merged with the twitter archive dataset.

Cleaning Data:

twitter_archive clean:

Define 1: Outliers removed in the rating_numerator and rating_denominator columns

Code 1:

Test 1:

Define 2: Drop unnecessary columns in twitter_archive_clean

Code 2 :

Test 2 :

Define 3 : convert tweet_id to string and timestamp convert to datetime

Code 3 :

Test 3 :

Define 4 : Dog classification (doggo, floofer, pupper or puppo) columms making one single column.

Code 4 :

Test 4:

Define 5: Remove the data which have rating denominator gretater than 10

Code 5:

Test 5:

Define 6:Null value of column dog_type is filled with the mode value for each source

Code 6:

Test 6:

Image_predictions_df clean:

Define 1: Change column labels

Code 1:

Test 1:

Define 2: Remove underscore and capitalize the first letter of each word

Code 2:

Test 2:

Define 3 : Build function to determine dog breed

Code 3:

Test 3:

Twitter api clean:

Define 1 : Outliers remove from favorite_count and retweet columns

Code 1:

Test 1:

Define 2 : Rename tweet_id column

Code 2 :

Test 2:

Define 3: Merging 3 dataframe together

Code 3:

Test 3:

Define 4: Drop rating_numerator and rating_denominator columns

Code 4 :

Test 4 :

Define 5: Null value of column rating is filled with the median value using trasnform method of breed_predicted column as it has outliers

code 5 :

Test 5:

Storing, Analyzing, and Visualizing Data

Data store:

Analyzing and Visualizing Data

Insight one : Top Dog type values based on the count

Majority of the tweets having higher count with pupper and many dog type are not categorized.

Insight two : Top dog name values based on the count

Insight three : Rating by dog type